-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use waitpid to iterate over all exited child processes #122
Conversation
@tchaloupka, @ZombineDev, this is what I had in mind - do you see any issues with this approach? |
Hi, I just happen to look on this too :) As I see your changes:
|
The POSIX standard just says that non-realtime signals may generally get coalesced, and this also applies to
True, I was worried about the same, although we of course already had the same issue with just relying on
It should return |
Damn, good to know, found it described i.e. here for reference: https://ldpreload.com/blog/signalfd-is-useless
Due to the signals coalesce, there probably isn't a better way to handle that.
Damn, I just didn't notice the In regard of the change, I've tried to add a testcase for this and ended up with something like #!/usr/bin/env dub
/+ dub.sdl:
name "test"
dependency "eventcore" path=".."
+/
module test;
import core.sys.posix.sys.wait : waitpid, WNOHANG;
import core.time : Duration, msecs;
import eventcore.core;
import std.process : thisProcessID;
import std.stdio;
int numProc;
void main(string[] args)
{
if (args.length == 2)
{
import core.thread : Thread;
writeln("Child: ", args[1], " from ", thisProcessID);
Thread.sleep(100.msecs);
}
else {
ProcessID[] procs;
foreach (_; 0..10) {
auto p = eventDriver.processes.spawn(
["./test", "hello"],
ProcessStdinFile(ProcessRedirect.inherit),
ProcessStdoutFile(ProcessRedirect.inherit),
ProcessStderrFile(ProcessRedirect.inherit),
null, ProcessConfig.none, null
);
assert(p != Process.init);
numProc++;
procs ~= p.pid;
auto wres = eventDriver.processes.wait(p.pid, (ProcessID pid, int res) nothrow
{
numProc--;
try writefln("Child %s exited with %s", pid, res);
catch(Exception){}
});
if (wres == 0) numProc--;
writeln("Started child: ", p.pid);
}
do eventDriver.core.processEvents(Duration.max);
while (numProc);
foreach (p; procs) assert(waitpid(cast(int)p, null, WNOHANG) == -1);
}
} It hangs sometimes infinitely with my implementation probably due to the signals coalesce. |
Sorry I should've added that I've also modified at my local version return value of |
0a6031e
to
2d44090
Compare
Thanks, I modified the test to reliably fail and also fixed the |
Great, thanks! |
After a lot of digging through the thousand puzzle pieces of the Posix API, it became clear that using So instead of Performance wise this is all pretty unfortunate, but it may be possible to make some advances in that regard later on. |
Forgot to CC @BenjaminSchaaf |
919c787
to
7689219
Compare
…eady exited. Avoids overlap with valid wait IDs, so that a paired cancelWait() doesn't cancel a different wait.
Instead of using waitpid(-1), explicitly waits on all known processes. This is inefficient for large numbers of child processes, but seems to be the only way to ensure to not interfere with other code that uses waitpid().
It turns out that in a heterogeneous process where other parts of the code may start processes or threads and may be waiting for those to finish, it is not realistic to rely on signalfd or even SIGCHLD in general to get notified about child process exits. The only solid way appears to be to start a separate waiter thread that uses waitid/waitpid to wait for exited child processes in a blocking way. This also fixes the hanging vibe.core.process test in vibe-core with DMD 2.087.x.
Integrates the contents of StaticProcesses into PosixEventDriverProcesses to fully hide it form the Windows build. It also changes lockedProcessInfo to be a non-template function, as that lead to a linker error on macOS.
7689219
to
5c3afcc
Compare
BTW, if there are no objections, I'd like to merge this today and tag a new release, so that the vibe-core/vibe.d CI finally passes again on DMD 2.087.x. |
@s-ludwig I plan on doing a review of this pr with some production code I have tonight (ie. next couple hours), if you don't mind waiting til then that would be great. From a cursory glance it looks fine though. |
Sure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly a much more simple and robust approach for this than my previous code, thanks for all the improvements @s-ludwig!
In terms of performance there is an option to use the pid returned by waitid
instead of checking all processes. This could however introduce subtle bugs in code using std.process alongside eventcore. (where we wait on a process spawned by std.process before other code can). Not sure whether that's worth the performance gains or not.
} (); | ||
} | ||
|
||
foreach (pid; allprocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here, but couldn't a process go out of reference at this point, causing lockedProcessInfo
to get a null ProcessInfo*
, resulting in a segfault? We should at least add an assert into lockedProcessInfo
to make sure the pointer is not null.
I think this also begs the question of what we should do if a process is not waited on, ie. the last reference is lost before it exits. Maybe it's worth putting an assert in releaseRef
to make sure the last reference is lost after the process has completed so that zombies are easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here, but couldn't a process go out of reference at this point, causing lockedProcessInfo to get a null ProcessInfo*, resulting in a segfault? We should at least add an assert into lockedProcessInfo to make sure the pointer is not null.
There is a if (info is null)
check at line 371 (onProcessExitStatic
), which should catch that case, if I'm not overlooking something.
I think this also begs the question of what we should do if a process is not waited on, ie. the last reference is lost before it exits. Maybe it's worth putting an assert in releaseRef to make sure the last reference is lost after the process has completed so that zombies are easier to debug.
I didn't notice it before starting to work on this PR, but the reference handling in general goes against the usual rules where releaseRef
must be the final call to free up the associated slot. The changes required to fix this are large enough that I'd like to split this into a separate PR, though, also considering that this issue already exists in the current master version.
BTW, I think the reason why the sequence spawn
-> wait
-> releaseRef
currently works is that the initial ref count is zero and wraps around to size_t.max
after the wait
is done, so that finally the releaseRef
call decrements it to size_t.,max - 1
without asserting. It means that currently all slots will leak and finally crash once a PID gets reused.
That was my original approach, but such bugs would be really nasty to track down, so if anything, I'd make that an opt-in behavior. I figured that for now this is okay, considering that I didn't come up with a use case that has more than maybe a few dozen child processes open. |
Just saw this. Might this help with vibe-d/vibe-core#205 ? |
This only affects programs that use the child process functionality which I very much doubt the "vanilla vibe.d server" from that issue is using. |
Yeah, I figured it out after reading more closely. The "zombie process" thing is what triggered my interest. Sorry for the noise. |
Fixes #116 and replaces #117 by extending the use of
waitpid
to also iterate over exited child processes in addition to avoiding zombie processes.